| Date | Topic |
|---|---|
| 16.11.2023 | Data preparation and manipulation |
| 23.11.2023 | Basic statistics and data analysis with R |
| 23.11.2023 | Exercises/Workshop 4: Data gathering, data import |
| 30.11.2023 | Guest Lecture: Matteo Courthoud (Senior Economist and Data Scientist @Zalando) |
| Date | Topic |
|---|---|
| 07.12.2023 | Visualisation, dynamic documents |
| 07.12.2023 | Exercises/Workshop 5: Data preparation and applied data analysis with R |
| 14.12.2023 | Guest Lecture: Florian Chatagny (Head of Data Science @Federal Finance Administration in Bern) |
| 21.12.2023 | Exercises/Workshop 6: Visualization, dynamic documents |
| 21.12.2023 | Summary, Wrap-Up, Q&A, Feedback |
| 21.12.2023 | Exam for Exchange Students |
Rectangular data
Non-rectangular data
Tell your future self what this script is all about
####################################################################### # Project XY: Data Gathering and Import # # This script is the first part of the data pipeline of project XY. # It imports data from ... # Input: links to data sources (data comes in ... format) # Output: cleaned data as CSV # # A. Sallin, St. Gallen, 2023 ####################################################################### # SET UP -------------- # load packages library(tidyverse) # set fix variables INPUT_PATH <- "/rawdata" OUTPUT_FILE <- "/final_data/datafile.csv" # IMPORT RAW DATA FROM CSVs -------------
Be the JSON file
{
"students": [
{
"id": 19091,
"firstName": "Peter",
"lastName": "Mueller",
"grades": {
"micro": 5,
"macro": 4.5,
"data handling": 5.5
}
},
{
"id": 19092,
"firstName": "Anna",
"lastName": "Schmid",
"grades": {
"micro": 5.25,
"macro": 4,
"data handling": 5.75
}
},
{
"id": 19093,
"firstName": "Noah",
"lastName": "Trevor",
"grades": {
"micro": 4,
"macro": 4.5,
"data handling": 5
}
}
]
}
Write an R code to extract a table with, as a first column, a vector of first names, and as a second column, the average grade per student. The table can be a data frame or a tibble.
<students>
<student>
<id>19091</id>
<firstName>Peter</firstName>
<lastName>Mueller</lastName>
<grades>
<micro>5</micro>
<macro>4.5</macro>
<dataHandling>5.5</dataHandling>
</grades>
</student>
<student>
<id>19092</id>
<firstName>Anna</firstName>
<lastName>Schmid</lastName>
<grades>
<micro>5.25</micro>
<macro>4</macro>
<dataHandling>5.75</dataHandling>
</grades>
</student>
<student>
<id>19093</id>
<firstName>Noah</firstName>
<lastName>Trevor</lastName>
<grades>
<micro>4</micro>
<macro>4.5</macro>
<dataHandling>5</dataHandling>
</grades>
</student>
</students>
<student id="19093" firstName="Noah" lastName="Trevor">
<grades micro="4" macro="4.5" dataHandling="5" />
</student>
Following Wickham (2014):
Tidy data. Source: Wickham and Grolemund (2017), licensed under the Creative Commons Attribution-Share Alike 3.0 United States license.
Not tidy:
## measure Jan.1 Jan.2 Jan.3 ## 1 Temperature 20 22 21 ## 2 Humidity 80 78 82
Tidy:
…
Not tidy 💩
## measure Jan.1 Jan.2 Jan.3 ## 1 Temperature 20 22 21 ## 2 Humidity 80 78 82
Tidy 😎
## # A tibble: 3 × 3 ## Date Temperature Humidity ## <chr> <dbl> <dbl> ## 1 Jan.1 20 80 ## 2 Jan.2 22 78 ## 3 Jan.3 21 82
Not tidy:
## temperature_location ## 1 22C_London ## 2 18C_Paris ## 3 25C_Rome
Tidy:
homework..
Not tidy:
## Student Econ DataHandling Management ## 1 Johannes 5.00 4.0 5.5 ## 2 Hannah 5.25 4.5 6.0 ## 3 Igor 4.00 5.0 6.0
Tidy:
homework..
Rrbind() in base R
bind_rows() from dplyr()
NAFor these reasons (+ performance, handling or row names, and handling of factors), dplyr::bind_rows() is preferred.
Long and wide data. Source: Hugo Tavares
Long and wide data with code. Source: Hugo Tavares
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.
Wickham, Hadley, and Garrett Grolemund. 2017. Sebastopol, CA: O’Reilly. http://r4ds.had.co.nz/.